This course is about open data science and the purpose is to get familiar with open data, open science and data science. If you are interested in these topics here is some additional information: open data, open access and data science.
We are going to use the R language for statistical computing, see R as a tool. Also worth mentioning that we can accomplish the whole course in web. That why it’s called a MOOC.
my Github respitory You can find it here
In this section I’m going to analyze the data set of students who participated to the course called Introduction to Social Statistics at Helsinki University in the fall 2014. These 183 students were asked to participate to the survey about their learning approaches consisting deep, strategic and surface orientated learning. In addition the students were asked to evaluate their attitude towars statistics.
You can find some basic information about the variables here
I’m going to use partly the above mentioned data which still contains the different learning orientations and attitude towars statistics. In addition the data variables consist students’ age, gender and course exam points.
Here you can find the data set what I’m going to use in this analysis which is the same address than inside the code below.
students2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")
Next I’m going to explore what kind of material the students2014 data set contains.
As you can see below, there are 166 observations i.e. students and 7 variables. This suggests that not all participants of the course did take a part of this survey as there were initially 183 students. Nevertheless the available data variables contain the age and gender of the students. The genders can be recogniced F as female and M as male.
The attitude and learning orientations were measured on the Likert scale from 1 to 5. The attitude towards statistics is a sum of 10 questions, the deep orientation is 12, the strategic is 8 and the surface orientation is a sum of 12 question. Last but not least the data contains the exam points of the course i.e. exam results.
Under the structure table you can see there are no missing values (False = 1162). This is useful information considering the regression modeling and that we don’t have to run the “Missing Value Analysis”.
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
table(is.na(students2014))
##
## FALSE
## 1162
Underneath you can see the summaries and histograms of the variables. There were no point to print histogram of gender since gender is not a continuous variable. Instead I was curious to see if there is any variation between genders and our variables therefore I formed boxplots to illustrate this. Additionally I printed the Normal QQ-plots to demonstrate the distributions of normality as the multivariate normality is one of the assumptions of the linear regression. Click to see additional information about the assumptions of linear regression.
The data contains 110 (66 %) women and 56 (34 %) men. The students’ age range between 17 and 55 where the mean age is around 25 and median is 22. From the boxplot you can see that the average age of women is lower than the average age of men. The histogram of age shows that there is some skewness in the distribution which is demonstrated in the QQ-plot as well.
As mentioned earlier the attitude towards statistics and the learning orientations were measured on the Likert scale from 1 to 5 and that the variables are sums of multiple items. This kind of variables are more likely to be normally distributed because of the Central limit theorem. You can notice in here too that all of these sums are quite normally distributed.
If you look at the attitude towards statistics you can notice that the average mean is around 3. But it is interesting that the male students seems to have somewhat higher mean than female students but it doesn’t affect so much to the overall mean because the group of men is noteworthy smaller.
When examining the learning orientation approaches you can perceive that the deep learning orientation seems to be predominant considering the average mean around 3.7 whereas the surface orientation seems to be lower with overall mean around 2.8. It seems that there isn’t much difference among deep orientation between genders even if the boxplot shows a slightly larger mean to males than females. However we can’t be sure about the statistical significance of that difference without testing it. Instead the boxplot of surface orientation implies there could be some differences between genders. The mean of surface orientation seems to be lower among men but the deviation seems to be larger compared with the group of women. Then something about the strategic learning approach. If you look at the summary below you can perceive that the overall mean is around 3.1 implying it to be more popular than surface orientation but less popular comparing with deep orientation. Surprisingly (for me) the strategic learning approach seems to bee more popular among female than male students.
The exam points seem to vary to some extent. The points range between 7 and 33 where the mean points are around 23. Check also the boxplot which implies that men have had higher points than women even though the sample means don’t deviate prominently. The histogram of age makes me wonder if we can assume the existence of the normal distribution but the QQ-plot suggests that the difference isn’t that remarkable.
SUMMARY
summary(students2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
STUDENTS’ AGE
library(ggplot2)
qplot(students2014$age, geom = "histogram", binwidth = 1, main = "Histogram for students' age", xlab = "age", col = I("grey"), fill = I("chartreuse3"))
qqnorm(students2014$age)
qplot(gender, age, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and age")
STUDENTS’ ATTITUDE TOWARDS STATISTICS
qplot(students2014$attitude, geom = "histogram", binwidth = 0.3, main = "Histogram for attitudes towards statistics", xlab = "attitude", col = I("grey"), fill = I("orange"))
qqnorm(students2014$attitude)
qplot(gender, attitude, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and attitude")
DEEP LEARNING ORIENTATION
qplot(students2014$deep, geom = "histogram", binwidth = 0.3, main = "Histogram for deep learning orientation", xlab = "deep", col = I("grey"), fill = I("mediumorchid"))
qqnorm(students2014$deep)
qplot(gender, deep, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and deep orientation")
STRATEGIC LEARNING ORIENTATION
qplot(students2014$stra, geom = "histogram", binwidth = 0.3, main = "Histogram for strategic learning orientation", xlab = "strategic", col = I("grey"), fill = I("blue"))
qqnorm(students2014$stra)
qplot(gender, stra, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and strategic orientation")
SURFACE LEARNING ORIENTATION
qplot(students2014$surf, geom = "histogram", binwidth = 0.3, main = "Histogram for surface learning orientation", xlab = "surface", col = I("grey"), fill = I("coral"))
qqnorm(students2014$surf)
qplot(gender, surf, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and surface orientation")
EXAM POINTS
qplot(students2014$points, geom = "histogram", binwidth = 1, main = "Histogram for exam points", xlab = "Exam points", col = I("grey"), fill = I("yellow"))
qqnorm(students2014$points)
qplot(gender, points, data = students2014, geom = "boxplot", fill = gender, main = "Boxplot of gender and points")
SCATTER PLOTS FOR SEARCHING THE LINEAR RELATIONSHIPS
Next I’m going to print some scatter plots to see if there could be any linear relationships between above explored variables which is also one assumption of linear regression. Later in this diary it is meant to formulate the regression model to explain particularly the exam points. Considering this I’m going to concentrate on those variables which could have a linear relationship with exam points.
First you can see the scatter plot matrix with every pair in the data (excluding gender). It seems that there could be some relationship between the attitude and exam points. Because of this I printed a scatter plot where you can take a closer look at these variables coloured by gender. Indeed the plot implies there could exist a linear relationship between the two variables. We’ll come back to this within the next section.
The pair matrix suggests that to some extent there could be a linear relationship also between strategic learning and exam points. But when you take a closer look at the separate scatter plot the dots seems to vary more than they seemed to do on the smaller picture (pair matrix). This is why I wanted to print those separate plots.
In the last matrix you can see a set of scatter plots including the distributions and the correlations between different variables. The matrix shows for example that the correlation between attitude and exam points is .437 whereas the correlation between strategic learning and points is only .146. Thus we have more evidence that the attitude could be a better variable than attitude to explain the course exam points. We will also see this in the next section.
library(ggplot2)
library(GGally)
pairs(students2014[-1], col = students2014$gender, main = "Scatter plot matrix")
attitude <- students2014$attitude
points <- students2014$points
gender <- students2014$gender
ggplot(students2014, aes(x = attitude, y = points, col = gender)) + geom_point() + xlab("Attitudes towards statistics") + ylab("Exam points") + ggtitle("Attitudes vs. Exam points") + geom_smooth(method = "lm")
stra <- students2014$stra
ggplot(students2014, aes(x = stra, y = points, col = gender)) + geom_point() + xlab("Strategic learning orientation") + ylab("Exam points") + ggtitle("Strategic learning vs. Exam points") + geom_smooth(method = "lm")
library(ggplot2)
library(GGally)
ggpairs(students2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20))) + ggtitle("Matrix of scatter plots and distributions")
Now I’m going to choose three variables to explain the exam points and to adjust a linear regression model. I’m choosing the variables based on the highest correlation values with respect to exam points. Explanatory variables will be attitude (r = .437), strategic learning (r = .146) and surface approach (r = -.144). Note that the earlier examination of correlations suggested that the linear relatioship between strategic learning and exam points might be negligible.
REGRESSION MODEL 1
Below you can see that my overall model seems to be significant as a result of F-test: F(3,162) = 14.13, p < .001. However there are some problems with the explanatory variables stra (strategic learning) and surf (surface learning) with the p-values more than .05. This means I have to exclude these variables and either choose other variables from the data set or run the model only with attitude since it’s the only significance variable at this point (p < .001).
I am too curious to see other variables to explain the exam points even if the correlation matrix suggested there might not be any significant variables left to explain the exam points. Let’s take a look at the MODEL 2.
Regression_model_1 <- lm(points ~ attitude + stra + surf, data = students2014)
summary(Regression_model_1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
REGRESSION MODEL 2
Beneath you can see that my hunch was correct because of the high p values with age and deep learning orientations, in both cases p > .05. It seems that I have to run one more model with attitude as an only explanatory variable. So let’s move on to see the third model.
Regression_model_2 <- lm(points ~ attitude + age + deep, data = students2014)
summary(Regression_model_2)
##
## Call:
## lm(formula = points ~ attitude + age + deep, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## attitude 3.59413 0.56959 6.310 2.56e-09 ***
## age -0.07716 0.05322 -1.450 0.149
## deep -0.60275 0.75031 -0.803 0.423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
REGRESSION MODEL 3
This seems better :).
First we can see some values of residuals which define the difference between the actual values and the predicted values of the exam points i.e. y - y(hat). We are going to take a closer look at the residuals after the next chapter.
Our overall model seems to be significant as a result of F-test: F(1,164) = 38.61, p < .001. The attitude seems to be significant as a predictor or as an explanatory variable: B = 3.5, p < .001. So if we wanted to do some statistical predictions with our model we would adjust the formula y(hat) = a + Bx to y(hat) = 11.64 + 3.5x. Note that a is denoted here as an intercept.
Additionally we can perceive that the attitude explains around 19 % from the variation of the course exam points. We can interpret this with multiple R-squared coefficient instead of adjusted coefficient because we have only one predictor in our model.
In our model we don’t have to worry about multicollinearity since we have only one predictor.
Regression_model_3 <- lm(points ~ attitude, data = students2014)
summary(Regression_model_3)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
MODEL PARAMETERS
I interpreted briefly the model parameters in the last chapter. Here the purpose is to interpret more closely the relationship between the attitude and exam points including the multiple R-squared.
A linear regression model with one predictor variable can be expressed with the following equation: Y = Bo + Bx + e. This is the same equation I used earlier with the expression of y(hat) = a + Bx without the residual error e which is an unmeasured variable. So considering our model parameters we have “Bo” or “a” which is the Y-intercept. Then we have B-coefficient (beta) which defines the slope of our regression line.
Then we can take a look at the coefficients of above summary (model 3). It states that our intercept has the value around 11.6 which is the value we would predict for Exam points (Y) if the attitude towards statistics was 0. Then we can take a look the estimated regression coefficient (B) of attitude which seems to be around 3.5. This means that if the variable of attitude (x) differed by one unit the exam points would differ by 3.5 units on average.
Have a look at the scatter plot below. With our equation and estimated coefficients we can express the following example. If some student had evaluated their attitude towards statistics to be 3 then we could predict their exam points to be 11.6 + 3*3.5 = 22.1 by average.
ggplot(students2014, aes(x = attitude, y = points)) + geom_point() + xlab("Attitudes towards statistics") + ylab("Exam points") + ggtitle("Attitudes vs. Exam points") + geom_smooth(method = "lm")
MULTIPLE R-SQUARED
After the evaluation of the parameters we have the multiple R-squared to explore. Just to remind that in the MODEL 3 the attitude explained around 19 % from the variation of the course exam points.
We can interpret the R-squared as a statistical measure of how close the data are to the fitted regression line. After this information we can express that our coefficient is somewhat moderate. We can perceive the same thing from the above scatter plot where all the observations do not fit to the regression line but instead are spread all over. Thus in general, the higher the R-squared, the better the model fits your data.
However it is noteworthy that the low R-squared value doesn’t automatically mean that our model is bad. For example in the field of human behaviour the R-squared values can be quite low since there might be many other factors affecting to our actions.
There are also some limitations considering the R-squared. For example it cannot determine whether the coefficient estimates and further predictions are biased. This is the reason why we explore the residual plots in the next section.
Finally I’m going to explore the validity of the model assumptions with graphical outputs including “Residuals vs. Fitted values”, “Normal QQ-plot” and “Residuals vs Leverage”.
First we take a look at the figure called Residuals vs. Fitted (also predicted) values. Ideally this plot doesn’t show any discernible patterns and the residuals should be centered on zero throughout the range of fitted value. The red line assists to observe if there are any distracting trends. From the first plot we can perceive that there is no bigger patterns or trends suggesting that we could use our regression model.
Before making any final conclusions lets take a look at the plot that is Normal QQ-plot of the residuals. Here we are trying to define if the residuals are sufficiently normally distributed. The residuals should fit to the line as smoothly as possible. Based on the graph there seems to be some deviation from the line but I think we can assume that residuals are still sufficiently normally distributed.
Finally we are going to analyse the plot called Residuals vs. Leverage. This plot helps to conclude if there are some influential outliers in our regression model. First we can see that there are not outlying values at the upper right or lower right corner considering the scale also. This suggests that there are no influential cases against our regression line thus we can assume that our leverage is low. On the other hand we have to analyse if there are any cases outside the Cook’s distance which we can perceive underneath the plot. It seems that some of our cases are close to this line with a large negative standardized residual. This means that some of our cases can be influential outliers against our regression line.
plot(Regression_model_3, which = c(1, 2, 5))
The correlation matrixes showed that there could be a relationship between attitude to statistics and course exam points. Other relatioships did not seem as possible. This comes evident after running the three different regression models where the MODEL 3 seemed to be the only significant one. The final model suggests that the attitude towards statistics can explain around 19 % from the variation of the course exam points concluding there might be other meaningful factors to better explain the exam results.
The graphical outputs of the model validation suggest that there are no major problems with residual diagnostics exept some cases which could be influential in terms of Cook’s distance.
I have to say that this journey has been challenging but fun. I’m very greedy to learn more especially about the R coding. If you had asked me some years ago, if I would do some coding in the future, I would had said no. But here we are thanks to the new course of open data science and our fantastic docent and assistants. So thank you guys.
The purpose of this chapter is to get familiar with the logistic regression.
I’m going to use a dataset of Student alcohol consumption from UCI Machine Learning Repository. The purpose of the original data have been to predict secondary school student alcohol comsumption in Portugal. This material contains two separate datasets of math and Portuguese language courses but to note that the questionnaires have been identical. There are also students who have answered both questionnaires.
=====================================
You can find more detailed information about those two datasets here.
Data reference: Using Data Mining To Predict Secondary School Student Alcohol Consumption. Fabio Pagnotta, Hossain Mohammad Amran. Department of Computer Science,University of Camerino.
======================================
In this analysis the purpose is to use combined datset where we have those students who have answered these two questionnaires i.e. students who have participated both in the math and the Portuguese language courses. To combine the two datasets we have utilized some of the backround questions including:
======================================
You can find the dataset which is utilized in this analysis here.
======================================
Underneath you can see all the variables of the combined dataset consisting 382 observations and 35 variables.
The following adjustments have also been made: The variables not used for joining the two data have been combined by averaging (including the grade variables). “alc_use”" is the average of “Dalc” (workday alcohol consumption) and “Walc” (weekend alcohol consumption). “high_use”" is TRUE if “alc_use” is higher than 2 and the scale ranges between 1 and 5 (from 1 - very low to 5 - very high).
joined_alcohol <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header = TRUE)
NAMES OF DATA VARIABLES
colnames(joined_alcohol)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Underneath you can take a closer look at the content of the data variables. The variables which i haven’t introduce yet includes:
STRUCTURE OF THE DATA VARIABLES
str(joined_alcohol)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
The purpose of this analysis is to study the relationships between high/low alcohol consumption and 4 variables of my own interest in the data. I think there are many interesting variables which might have the relationship with the alcohol consumption.
HYPOTHESIS 1
First variable will be sex i.e. female or male. I believe the gender has a relationship with alcohol consumption but it is impossible to say which gender type has a bigger probability to be a “high user” or “low user”. One of the reasons are that I’m unfamiliar with the culture of Portugal and their habits of alcohol consumption.
HYPOTHESIS 2
The second interesting variable will be the weekly studytime which has the possible outcomes of 1 (less than 2 hours), 2 (2 to 5 hours), 3 (5 to 10 hours) or 4 (more than 10 hours).
I believe that the weekly studytime and alcohol consumption have a relationship in the way that the less a student studies the bigger is the probability to be an alcohol “high user” and vice versa.
HYPOTHESIS 3
The third variable which I’m going to choose is goout as going out with friends. This variable has the possible values from 1 (very low) to 5 (very high).
The variable of going out with friends doesn’t tell us which kind of activities it consists but I believe that the more one goes out the bigger probability is to use alcohol. That is the more the student goes out the greater might be the probability to be an alcohol “high user” and vice versa.
HYPOTHESIS 4
The last variable which interest me is famrel as quality of family relationships which ranges from 1 (very bad) to 5 (excellent).
I believe there can be a relationship between family relationships and alcohol consumption in such a way that the worse these relationships are the bigger is the probability to be a “high user” of alcohol. Of course in a bigger picture the other relevant aspect could be the overall importance of these family relationships which can define if these relationships affect to the student’s alcohol consumption at all.
In this section I’m going to explore the distributions of the variables which were introduced in the previous chapter. The second purpose is to explore the possible relationships between alcohol consumption and the chosen variables.
library(tidyr); library(dplyr); library(ggplot2); library(knitr)
SEX AND ALCOHOL CONSUMPTION.
As you can see from the table below there are 198 females and 184 males in the dataset thus the groups seem to be nearly same size. Furthermore there are 382 students in the dataset.
sex <- joined_alcohol$sex
alc_use <- joined_alcohol$alc_use
high_use <- joined_alcohol$high_use
joined_alcohol %>% count(sex) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 2 × 3
## sex n freq
## <fctr> <int> <dbl>
## 1 F 198 51.83246
## 2 M 184 48.16754
The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).
The table also compares the alcohol consumption between genders and it suggests that there are somewhat more “high users” among men (n = 71) than women (n = 41). Other way round the table suggests that there are somewhat more “low users” among women (n = 157) than men (n = 113. These results suggest that there could be a relationship between sex and alcohol consumption.
joined_alcohol %>% count(sex, high_use) %>% mutate(freq = n / sum(n = 382) * 100)
## Source: local data frame [4 x 4]
## Groups: sex [2]
##
## sex high_use n freq
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 157 41.09948
## 2 F TRUE 41 10.73298
## 3 M FALSE 113 29.58115
## 4 M TRUE 71 18.58639
The first bar chart illustrates the alcohol consumption between genders wheras the second chart clarifies the high use of alcohol between genders. Thus you can see the same results as described above.
g1 <- ggplot(data = joined_alcohol, aes(x = alc_use, fill = sex))
g1 + geom_bar() + ggtitle("Alcohol consuption between genders")
g2 <- ggplot(data = joined_alcohol, aes(x = high_use, fill = sex))
g2 + geom_bar() + ggtitle("Alcohol high use between genders")
WEEKLY STUDYTIME AND ALCOHOL CONSUMPTION.
The first 3 tables illustrate the weekly studytime in general and between genders. To recall the measurement scale which ranges between 1 and 4, where
The first table discribes the total amount of studytime between students and as seen the popular studytime seems to vary from 2 to 5 hours per week considering the course. The same emphasis can be seen from the second table which indicates that the mean studytime is around 2 (2 to 5 hours per week).
The third table divides the mean of the studytime between genders and it suggests that female may spend more time to study than do males.
studytime <- joined_alcohol$studytime
joined_alcohol %>% count(studytime) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 4 × 3
## studytime n freq
## <int> <int> <dbl>
## 1 1 103 26.963351
## 2 2 190 49.738220
## 3 3 62 16.230366
## 4 4 27 7.068063
summary(studytime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.034 2.000 4.000
joined_alcohol %>% group_by(sex) %>% summarise(count = n(), mean = mean(studytime))
## # A tibble: 2 × 3
## sex count mean
## <fctr> <int> <dbl>
## 1 F 198 2.272727
## 2 M 184 1.777174
The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).
The table also compares the high use of alcohol between different amount of studytime. If you look at the frequencies (%), which are adjusted to count the percentages inside the unit of studytime, you can perceive that the frequencies of high use are somewhat lower inside the units of 3 and 4 and somewhat higher inside the units of 1 and 2. I think this may correspond with my hyothesis 2 where i suggested that the less a student studies the higher might be the probability to be a high user of alcohol.
joined_alcohol %>% count(studytime, high_use) %>% mutate(freq = n / sum(n) * 100)
## Source: local data frame [8 x 4]
## Groups: studytime [4]
##
## studytime high_use n freq
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 60 58.25243
## 2 1 TRUE 43 41.74757
## 3 2 FALSE 133 70.00000
## 4 2 TRUE 57 30.00000
## 5 3 FALSE 54 87.09677
## 6 3 TRUE 8 12.90323
## 7 4 FALSE 23 85.18519
## 8 4 TRUE 4 14.81481
The boxplot below compares the high alcohol consumption to students’ weekly studytime between genders.
g2 <- ggplot(data = joined_alcohol, aes(x = high_use, y = studytime, col = sex))
g2 + geom_boxplot() + ggtitle("High use consumption vs weekly studytime")
GOING OUT WITH FRIENDS AND ALCOHOL CONSUMPTION.
The first 3 tables illustrate the variable of going out with friends. The measurement scale varies between 1 (very low) and 5 (very high) and I assume that the scale means to count going out often or not so often.
The first table describes the total amounts of going out between students and the summary table indicates that the mean of going out is around 3.
The third table divides the mean of the going out between genders and it suggests that there might not be any differences between genders.
goout <- joined_alcohol$goout
joined_alcohol %>% count(goout) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 5 × 3
## goout n freq
## <int> <int> <dbl>
## 1 1 24 6.282723
## 2 2 99 25.916230
## 3 3 123 32.198953
## 4 4 82 21.465969
## 5 5 54 14.136126
summary(goout)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.113 4.000 5.000
joined_alcohol %>% group_by(sex) %>% summarise(count = n(), mean = mean(goout))
## # A tibble: 2 × 3
## sex count mean
## <fctr> <int> <dbl>
## 1 F 198 3.045455
## 2 M 184 3.184783
The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).
The table compares the high use of alcohol between different frequencies (i.e. not going out often vs. going out often) of going out with friends. If you look at column of frequencies (%), which are adjusted to count the percentages inside the category of going out, you can perceive that the relative frequencies of alcohol high use are somewhat growing the more the student goes out with her/his friends. This may correspond with my hypothesis 3 where I suggested that the more the student goes out the greater might be the probability to be an alcohol high user and vice versa.
joined_alcohol %>% count(goout, high_use) %>% mutate(freq = n / sum(n) * 100)
## Source: local data frame [10 x 4]
## Groups: goout [5]
##
## goout high_use n freq
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 21 87.50000
## 2 1 TRUE 3 12.50000
## 3 2 FALSE 84 84.84848
## 4 2 TRUE 15 15.15152
## 5 3 FALSE 100 81.30081
## 6 3 TRUE 23 18.69919
## 7 4 FALSE 43 52.43902
## 8 4 TRUE 39 47.56098
## 9 5 FALSE 22 40.74074
## 10 5 TRUE 32 59.25926
The boxplot below compares the high alcohol consumption to frequency to go out with friends between genders.
g3 <- ggplot(data = joined_alcohol, aes(x = high_use, y = goout, col = sex))
g3 + geom_boxplot() + ggtitle("High use consumption vs going out with friends")
FAMILY RELATIONSHIPS AND ALCOHOL CONSUMPTION.
The first 3 tables illustrate the variable of quality of family relationships. This variable has been measured from 1 to 5 describing the family relationships to vary between very bad and excellent.
The first table describes the total amounts of qualities in each category and it unfortunately suggests that these family relationships are somewhat bad by average. According the second table indicates that the mean quality is around 4 which I interpreted to illustrate bad quality of family relationships.
The third table divides the above mentioned mean on to categories of different genders. These results implies that there might not be any differences between genders considering the quality of family relationships.
famrel <- joined_alcohol$famrel
joined_alcohol %>% count(famrel) %>% mutate(freq = n / sum(n) * 100)
## # A tibble: 5 × 3
## famrel n freq
## <int> <int> <dbl>
## 1 1 9 2.356021
## 2 2 18 4.712042
## 3 3 66 17.277487
## 4 4 183 47.905759
## 5 5 106 27.748691
summary(famrel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 4.00 3.94 5.00 5.00
joined_alcohol %>% group_by(sex) %>% summarise(count = n(), mean = mean(famrel))
## # A tibble: 2 × 3
## sex count mean
## <fctr> <int> <dbl>
## 1 F 198 3.878788
## 2 M 184 4.005435
The table below is created to show the amount of students who have evaluated their alcohol consumption lower or exactly equal to 2 OR higher than 2. The former suggests that the high use of alcohol consumption is false and the latter implies that the high use is true. Remember that the scale varies from 1 (very low) to 5 (very high).
The table compares the high use of alcohol between different qualities of family relationships. Again the frequencies (%) have adjusted to count the percentages inside each category since in this way it is easier to compare relative frequencies between different quality categories.
It seems that high use is more common somewhere in the middle between bad and average kind of family relationships and the extreme values (very bad and excellent) don’t seem differ from eac other. Thus these results may not correspond with my hypothesis 4 where I suggested that the worse the relatioships are the higher is theprobability to be an alcohol high user.
joined_alcohol %>% count(famrel, high_use) %>% mutate(freq = n / sum(n) * 100)
## Source: local data frame [10 x 4]
## Groups: famrel [5]
##
## famrel high_use n freq
## <int> <lgl> <int> <dbl>
## 1 1 FALSE 7 77.77778
## 2 1 TRUE 2 22.22222
## 3 2 FALSE 9 50.00000
## 4 2 TRUE 9 50.00000
## 5 3 FALSE 40 60.60606
## 6 3 TRUE 26 39.39394
## 7 4 FALSE 131 71.58470
## 8 4 TRUE 52 28.41530
## 9 5 FALSE 83 78.30189
## 10 5 TRUE 23 21.69811
The boxplot below compares the high alcohol consumption to qualities of family relationships between genders.
g4 <- ggplot(data = joined_alcohol, aes(x = high_use, y = famrel, col = sex))
g4 + geom_boxplot() + ggtitle("High use consumption vs family relationships")
In this chapter the purpose is to use logistic regression and find out if the previously chose variables are related to the high use of alcohol consumption and if the earlier suggested hypotheses can be considered as appropriate.
Before fitting the model I will inspect if there are any missing values in the combined data of alcohol consumption. The table below indicates that there are no missing values since the false is 13 370.
TABLE OF THE MISSING VALUES
table(is.na(joined_alcohol))
##
## FALSE
## 13370
SUMMARY OF THE FITTED MODEL
My model contains both categorical (i.e. “sex”) and continuous variables (i.e. other variables) as independent variables.
First we can see that all the coefficients of independent variables are significant. The variable of the sex suggests a significant association of the sex of the student with the probability of the high use alcohol consumption. The positive coefficient for this predictor suggests that all other variables being equal, the male student is more likely to be a high user of alcohol compared to females. In my first hypothesis I suggested that the sex might have a relatioship with alcohol consumption but I couldn’t state if there are any differencies between genders.
Then we have the significant variable of weekly studytime suggesting an association of studytime with the probability of the high use of alcohol. The negative coefficient for this predictor implies that if the studytime increases, the probability (log odds) of being a high user of alcohol decreases if all other variables remain constant. This result corresponds with my second hypothesis where I suggested that the weekly studytime and alcohol consumption have a relationship in the way that the less a student studies the higher is the probability to be a high user and vice versa.
The variable of going out with friends (“goout”) has the lowest p-value suggesting a strong association of the going out with the probability of being an alcohol high user. The positive coefficient for the predictor suggests if one goes out more, the probability (log odds) of being a high user increases. This result is significant if all other variables remain constant. This corresponds with my third hypothesis where I suggested that the more the student goes out with her/his friends the higher may be the probability to be a high user and vice versa.
The last variable is the famrel which represents the quality of family relationhips. The significance of the factor suggests an association of the family relationships with the probability of being an alcohol high user. The negative coefficient implies that the better these relatinships are the higher is the probability (log odds) to be a low user of alcohol. Thus somewhat surprisingly this corresponds with fourth hypothesis. The surprising part comes from the chapter 3.3 after exploring the data structure of alcohol high use together with family relationships. After reading the summaries I started to doubt the possible relationship between these variables.
my_model <- glm(high_use ~ sex + studytime + goout + famrel, data = joined_alcohol, family = "binomial")
summary(my_model)
##
## Call:
## glm(formula = high_use ~ sex + studytime + goout + famrel, family = "binomial",
## data = joined_alcohol)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5928 -0.7494 -0.4919 0.8250 2.7080
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2112 0.7142 -1.696 0.08991 .
## sexM 0.7593 0.2655 2.860 0.00424 **
## studytime -0.4752 0.1703 -2.790 0.00527 **
## goout 0.7947 0.1218 6.523 6.89e-11 ***
## famrel -0.4496 0.1379 -3.260 0.00111 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 379.27 on 377 degrees of freedom
## AIC: 389.27
##
## Number of Fisher Scoring iterations: 4
ODDS RATIOS
Here the coefficients of the above regression model are exponentiated and therefore we can interpret them as odds-ratios (OR). Additionally the table below contain the confidence intervals (CI) of the odds-ratios.
The confidence intervals of the table below suggest that the results of my logistic regression model can be considered as significant. This is because all of the coefficients are positive and their variations don’t cross the zero.
OR <- coef(my_model) %>% exp
CI <- confint(my_model) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2978484 0.07158146 1.1881189
## sexM 2.1368457 1.27450527 3.6171734
## studytime 0.6217330 0.44091699 0.8617913
## goout 2.2138196 1.75581193 2.8339395
## famrel 0.6378742 0.48488876 0.8343189
probabilities <- predict(my_model, type = "response")
joined_alcohol <- mutate(joined_alcohol, probability = probabilities)
joined_alcohol <- mutate(joined_alcohol, prediction = probability > 0.5)
table(high_use, prediction = probabilities > 0.5)
## prediction
## high_use FALSE TRUE
## FALSE 244 26
## TRUE 66 46
library(dplyr)
g5 <- ggplot(joined_alcohol, aes(x = probability, y = high_use, col = prediction))
g5 + geom_point()
table(high_use, prediction = joined_alcohol$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 26
## TRUE 66 46
table(high_use, prediction = joined_alcohol$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.63874346 0.06806283
## TRUE 0.17277487 0.12041885
In this chapter the purpose is to get familiar with clustering and classification. I’m going to use a dataset called Boston which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The data was originally published by Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978. You can find some additional information here and here.
This dataset is actually already loaded into R thus let’s take a look at some details.
DATA VARIABLES
STRUCTURE OFTHE BOSTON DATASET
From the structure table you can notice that there are 506 observation and 14 variables including 13 continuous variables and 1 binary-valued variable “chas”.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
In this section I’m going to explore the distributions of data variables even further. First I show you a summary table which contains the distributions of each variable where the variables are divided into quartiles including the minimum and maximum values.
After this I’ll show you the histograms of these variables excluding chas and black. Chas is a dummy variable concisting the values 1 or 0. The reason for me of dropping black is ethical. Remember that the current dataset is already “long in the tooth”. However, from the histograms you can have a better understanding of the variation of each variable.
At the end of this chapter I’ll explore the correlations between the variables and demonstrate them by means of the the correlation matrix table and the correlation matrix plot.
SUMMARY TABLE OF DATA VARIABLES
From the summary table below you can perceive that all of the ditributions varies between very different scales.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
HISTOGRAMS OF CRIM, ZN, INDUS AND NOX
library(ggplot2); library(gridExtra);library(dygraphs)
# INITIALIZING THE PLOTS
# Histogram of crim
hcrim <- qplot(Boston$crim, geom = "histogram", binwidth = 6, main = "Crime rate by town", xlab = "crime rate", col = I("grey"), fill = I("chartreuse3"))
# Histogram of zn
hzn <- qplot(Boston$zn, geom = "histogram", binwidth = 6, main = "Prop. of residential land zoned", xlab = "zn", col = I("grey"), fill = I("orange"))
# Histogram of indus
hindus <- qplot(Boston$indus, geom = "histogram", binwidth = 3, main = "Prop. of non-retail business", xlab = "indus", col = I("grey"), fill = I("mediumorchid"))
hnox <- qplot(Boston$nox, geom = "histogram", binwidth = 0.02, main = "Nitrogen oxides concentration", xlab = "nox", col = I("grey"), fill = I("blue"))
# Combining the plots using the function grid.arrange() [in gridExtra]
# Additional information from (http://www.sthda.com/english/wiki/ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page-r-software-and-data-visualization)
grid.arrange(hcrim, hzn, hindus, hnox, ncol=2, nrow =2)
HISTOGRAMS OF RM, AGE, DIS AND RAD
library(ggplot2); library(gridExtra); library(dygraphs)
# INITIALIZING THE PLOTS
hrm <- qplot(Boston$rm, geom = "histogram", binwidth = 0.5, main = "Number of rooms", xlab = "rooms", col = I("grey"), fill = I("yellow"))
hage <- qplot(Boston$age, geom = "histogram", binwidth = 10, main = "Owner-occupied units", xlab = "age of the unit", col = I("grey"), fill = I("chocolate"))
hdis <- qplot(Boston$dis, geom = "histogram", binwidth = 1, main = "distances to 5 employment centres", xlab = "distance", col = I("grey"), fill = I("cornflowerblue"))
hrad <- qplot(Boston$rad, geom = "histogram", binwidth = 1, main = "Access to radial highways", xlab = "radial highways", col = I("grey"), fill = I("coral"))
# Combining the plots using the function grid.arrange() [in gridExtra]
grid.arrange(hrm, hage, hdis, hrad, ncol=2, nrow =2)
HISTOGRAMS OF TAX, PTRATIO, LSTAT AND MEDV
library(ggplot2); library(gridExtra); library(dygraphs)
# INITIALIZING THE PLOTS
htax <- qplot(Boston$tax, geom = "histogram", binwidth = 30, main = "Property tax rate per $10,000", xlab = "tax rate", col = I("grey"), fill = I("red"))
hptratio <- qplot(Boston$ptratio, geom = "histogram", binwidth = 1, main = "Pupil-teacher ratio by town", xlab = "pupil-teacher ratio", col = I("grey"), fill = I("purple"))
hlstat <- qplot(Boston$lstat, geom = "histogram", binwidth = 2, main = "Lower status of the population %", xlab = "lower status", col = I("grey"), fill = I("palegreen4"))
hmedv <- qplot(Boston$medv, geom = "histogram", binwidth = 4, main = "Owner-occupied homes in $1000s", xlab = "Owner-occupied homes", col = I("grey"), fill = I("violetred2"))
# Combining the plots using the function grid.arrange() [in gridExtra]
grid.arrange(htax, hptratio, hlstat, hmedv, ncol=2, nrow =2)
CORRELATION MATRIX BETWEEN DATA VARIABLES
library(tidyverse)
cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
PLOTTING THE CORRELATION MATRIX
I used the corrplot() to better demonstrate the above correlation matrix. You can see this plot below and it is good to know that positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. The legend color shows the correlation coefficients and the corresponding colors. I also added the values of correlations of each pair even though the plot is not so beautiful after this.
You perceive that the highest correlations are between nox and indus as well as between tax and indus. This is kind of logical that nitrogen oxide levels and tax rates are higher in industrial areas.
Within the next chapters I’m going to concentrate more on crime rates as a target variable thus its also convenient to explore here the relationships between crime rate and other variables. You can see the highest positive correlations are between crime rate and rad as well as crime rate and tax rate. I’m not sure what these are actuallyt telling us. An by this I mean that sometimes variables can correlate without any specific cause - effect impacts. Here the crime rate by town correlates positive with accessibility to radial highways but I can’t figure out any “real” reasons why this is happening. Maybe the criminals do their crimes in towns from where they can get easily away :D :D
And maybe the positive correlation between crime rate and tax rate indicates that there are more crimes in those areas with bigger properties because bigger properties have bigger taxes. Also there might live wealthier people in bigger properties which may be more tempting for example to the thiefs. I’m not sure what kind of crimes the crime rate contains but I think you get my point.
Then there are more moderate negative correlations between crime rate and medv as well as between crime rate and dis. The medv indicates the median value of owner-occupied homes in $1000s and the correlation suggests that the crime rate is higher when the median value of owner-occupied homes is lower and vice versa. I don’t try to interpret this unattached from the bigger picture because there can be several reasons fro this. The dis indicates the weighted mean of distances to five Boston employment centres. Thus the negative correlation tells us that the higher the crime rate, the lower the mean distance and vice versa. This may be logical in the way that there can more crimes near these centres (shorter distance) comparing the more rural areas (greater distance).
Note that under the corrplot you can find a pair (scatter plot) matrix which illustrates also the relationships between different variables.
library(corrplot); library(dygraphs)
corrplot(cor_matrix, method = "circle", main = "Correlation matrix of Boston variables", tl.cex = 1, addCoef.col = "black")
library(ggplot2)
library(GGally)
pairs(Boston [-4], main = "Scatter plot matrix")
In this chapter I’m going to scale or standardize the whole dataset.
SUMMARIES OF THE SCALED DATA
As you can see from the table below, the distributions of the variables changed after stardardizing the dataset. Now the mean of every variable is zero and the observations varies on both sides of zero. This means that the different distributions are now more comparable for the further purpose.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
CATEGORIZING THE CRIME RATE
There are a lot of technicalities in this part thus I decided to describe the steps inside the r code chunks. the main point here is to create a new categorical variable based on the scaled crime rate and split it on to 4 categories by means of the its quantiles labelled as “low”, “med_low”, “med_high” and “high”. After this I remove the “original” scaled crime rate from the dataset.
# Converting the scaled data into a data frame
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
# Using the quantiles as the break points in the categorical variable of scaled_crim
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# Creating a categorical variable "crime"
boston_scaled$crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
crime <- boston_scaled$crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# Verifying the content
colnames(boston_scaled)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
## [15] "crime"
class(boston_scaled)
## [1] "data.frame"
# Dropping the old crime rate from the dataset
library(dplyr)
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Veryfying the content
colnames(boston_scaled)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
CREATING THE TRAIN AND TEST SETS
# Choosing randomly 80 % of the row
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
# Creating train and test sets
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# Inspecting the content
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num 1.015 1.015 -0.211 1.015 1.567 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num 0.365 1.599 0.262 0.512 0.598 ...
## $ rm : num -2.3578 0.251 -0.8221 -0.0792 0.1243 ...
## $ age : num 1.116 0.878 -0.518 0.69 1.042 ...
## $ dis : num -1.064 -0.851 -0.671 -0.876 -0.697 ...
## $ rad : num 1.66 1.66 -0.408 1.66 -0.637 ...
## $ tax : num 1.529 1.529 -0.102 1.529 0.171 ...
## $ ptratio: num 0.806 0.806 0.344 0.806 1.268 ...
## $ black : num -3.591 -3.606 0.441 0.292 0.319 ...
## $ lstat : num 3.0411 0.7558 -0.0901 0.064 -0.2147 ...
## $ medv : num -0.5037 -1.4062 -0.0797 -0.1232 0.0508 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 4 4 2 4 3 4 3 3 4 2 ...
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num -0.487 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.306 -0.437 -0.755 -0.616 -0.616 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.144 -0.481 -0.921 -0.921 ...
## $ rm : num 0.207 -1.017 -0.5 -0.105 -0.858 ...
## $ age : num -0.3508 1.0489 -0.0133 -2.2052 -1.2354 ...
## $ dis : num 1.07667 0.00136 -0.20646 0.91459 0.61991 ...
## $ rad : num -0.752 -0.637 -0.522 -0.752 -0.752 ...
## $ tax : num -1.105 -0.601 -0.767 -1.04 -1.04 ...
## $ ptratio: num 0.113 1.175 0.344 -0.257 -0.257 ...
## $ black : num 0.41 0.218 0.441 0.414 0.441 ...
## $ lstat : num -1.042 1.172 -0.416 -0.73 -0.342 ...
## $ medv : num 0.671 -0.971 -0.395 0.236 -0.352 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 3 1 2 2 2 1 1 2 2 ...
colnames(train)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
colnames(test)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "crime"
Linear discriminant analysis is a classification technique and its preferred over Logistic regression if the target variable has more than two classes. LDA has some assumption about the data and one of these is that the input variables are normally distributed. It also assumes that each attribute has the same variance Reference.
Note that the data was standardized earlier in terms of the latter assumption.
FITTING THE LINEAR DISCRIMINANT ANALYSIS
From the table below we can see for example that the model predicts 24.3% of crime rate in the training set correspond to the low crime rate. The output provides also the group means which are the average of each predictor within each class.
LDA defines as many “discriminant functions” as the number of categories of the outcome minus one. I have four categories thus the LDA output shows three functions under the headline “Proportion of trace”. These functions are ordered by the amount of variance that they explain.
library(MASS)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2475248 0.2599010 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.90206744 -0.9450350 -0.069385757 -0.8708722 0.4051783
## med_low -0.07004565 -0.3155346 0.003267949 -0.5344089 -0.1328667
## med_high -0.39658567 0.1266744 0.252617630 0.3766713 0.1591382
## high -0.48724019 1.0149946 -0.040734936 1.0302534 -0.4215012
## age dis rad tax ptratio
## low -0.8758903 0.8964919 -0.6977144 -0.7574583 -0.44041373
## med_low -0.3021401 0.3781277 -0.5351175 -0.4557878 -0.08893231
## med_high 0.4140353 -0.3691359 -0.4284195 -0.3465056 -0.27508018
## high 0.7856899 -0.8345219 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3767072 -0.77646941 0.5123954732
## med_low 0.3187098 -0.12220357 -0.0009575105
## med_high 0.1122242 0.01054713 0.1962888423
## high -0.7822006 0.88799830 -0.6787582468
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.143701937 0.63051802 -1.02824859
## indus 0.047508020 -0.21697145 0.28285913
## chas -0.090448003 -0.03186641 0.07463214
## nox 0.208188054 -0.85557170 -1.33861423
## rm -0.202570262 -0.17738989 -0.18161623
## age 0.281501194 -0.33769862 -0.01793288
## dis -0.127676920 -0.21316458 0.29043043
## rad 3.774136689 0.82780342 -0.30893098
## tax -0.001161338 0.14159124 0.79536650
## ptratio 0.168980794 -0.06082067 -0.36088087
## black -0.153249002 -0.01037768 0.11968461
## lstat 0.131132194 -0.26418191 0.36046267
## medv 0.175679874 -0.38992254 -0.18274507
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9606 0.0305 0.0090
# Creating a numeric vector of the train sets crime classes for plotting purposes
classes <- as.numeric(train$crime)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plotting the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.5)
LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The class that gets the highest probability is the output class and a prediction is made Reference.
First I’m going to save the four crime classes from the test set to “correct_classes” and then I remove the variable from test set, so that I can use the categorized crime as a target variable in LDA prediction when predicting the test data.
SAVING THE CRIME CATEGORIES
# Saving the crime classes from the test data into correct_classes
correct_classes <- test$crime
# Removing crime from the test set
library(dplyr)
test <- dplyr::select(test, -crime)
# Veryfying the content of test and correct_classes
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num -0.487 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.306 -0.437 -0.755 -0.616 -0.616 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.144 -0.481 -0.921 -0.921 ...
## $ rm : num 0.207 -1.017 -0.5 -0.105 -0.858 ...
## $ age : num -0.3508 1.0489 -0.0133 -2.2052 -1.2354 ...
## $ dis : num 1.07667 0.00136 -0.20646 0.91459 0.61991 ...
## $ rad : num -0.752 -0.637 -0.522 -0.752 -0.752 ...
## $ tax : num -1.105 -0.601 -0.767 -1.04 -1.04 ...
## $ ptratio: num 0.113 1.175 0.344 -0.257 -0.257 ...
## $ black : num 0.41 0.218 0.441 0.414 0.441 ...
## $ lstat : num -1.042 1.172 -0.416 -0.73 -0.342 ...
## $ medv : num 0.671 -0.971 -0.395 0.236 -0.352 ...
colnames(test)
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv"
str(correct_classes)
## Factor w/ 4 levels "low","med_low",..: 1 3 1 2 2 2 1 1 2 2 ...
summary(test)
## zn indus chas nox
## Min. :-0.4872 Min. :-1.40908 Min. :-0.2723 Min. :-1.42991
## 1st Qu.:-0.4872 1st Qu.:-0.89307 1st Qu.:-0.2723 1st Qu.:-1.01568
## Median :-0.4872 Median :-0.16425 Median :-0.2723 Median :-0.29941
## Mean : 0.1063 Mean : 0.06266 Mean :-0.1565 Mean :-0.06589
## 3rd Qu.: 0.3703 3rd Qu.: 1.01499 3rd Qu.:-0.2723 3rd Qu.: 0.59809
## Max. : 3.5861 Max. : 2.11552 Max. : 3.6648 Max. : 2.72965
## rm age dis
## Min. :-3.055198 Min. :-2.22300 Min. :-1.242784
## 1st Qu.:-0.586570 1st Qu.:-1.05244 1st Qu.:-0.809094
## Median :-0.059968 Median : 0.23714 Median :-0.314475
## Mean : 0.002627 Mean :-0.08273 Mean :-0.008745
## 3rd Qu.: 0.626395 3rd Qu.: 0.91123 3rd Qu.: 0.701999
## Max. : 2.864100 Max. : 1.11639 Max. : 3.956602
## rad tax ptratio
## Min. :-0.98187 Min. :-1.312691 Min. :-2.70470
## 1st Qu.:-0.63733 1st Qu.:-0.784617 1st Qu.:-0.46446
## Median :-0.52248 Median :-0.143809 Median : 0.25149
## Mean :-0.03045 Mean :-0.005537 Mean :-0.01659
## 3rd Qu.: 1.20022 3rd Qu.: 1.237192 3rd Qu.: 0.80578
## Max. : 1.65960 Max. : 1.529413 Max. : 1.26768
## black lstat medv
## Min. :-3.878357 Min. :-1.36857 Min. :-1.906340
## 1st Qu.: 0.225872 1st Qu.:-0.81753 1st Qu.:-0.514598
## Median : 0.383219 Median :-0.24479 Median :-0.221027
## Mean :-0.004026 Mean :-0.04064 Mean :-0.009643
## 3rd Qu.: 0.433387 3rd Qu.: 0.49565 3rd Qu.: 0.284567
## Max. : 0.440616 Max. : 3.40663 Max. : 2.986505
PREDICTING CLASSES WITH TEST DATA
In this part I’m going to predict the classes with the test data. In addition to our exercises in DataCamp I used the actual command of confusionMatrix() to illustrate the results more clearly. And because of this the table of predicted values comes actually twice. I kept the first table because I wanted to be sure that I manged to code the confusion matrix correctly.
From the confusion matrix we can see that the estimated accuracy on the final model is 73.5%. You can see the same result also from the table where the correct classes are against the predicted classes. And if you like you can count the accuracy % by hand: sum of correctly predicted values divided with the total amount of values.
# Predicting classes with the test data
lda.pred <- predict(lda.fit, newdata = test)
lda.pred$class
## [1] med_low med_high med_low low med_low med_low low
## [8] low med_low med_low med_low med_low med_low med_low
## [15] med_low med_low low low med_low low low
## [22] med_high med_high med_high med_low med_low med_low med_high
## [29] med_high med_high med_high med_high med_high med_high med_high
## [36] med_high med_high med_high med_high med_high med_high med_low
## [43] med_high low low low low low med_low
## [50] med_low low med_low med_low med_high med_high med_high
## [57] low low low low low low low
## [64] low med_low med_low low med_low med_low med_low
## [71] med_low med_low med_low low low high high
## [78] high high high high high high high
## [85] high high high high high high high
## [92] high high high high high high high
## [99] high high high med_high
## Levels: low med_low med_high high
# Cross tabulating the result
table1 <- table(correct = correct_classes, predicted = lda.pred$class)
table1
## predicted
## correct low med_low med_high high
## low 18 10 2 0
## med_low 6 14 6 0
## med_high 0 5 14 2
## high 0 0 1 24
# Some of the packages can be irrelevant. Different sites proposed different packages and then R noted to require some other packages
library(caret); library(lattice); library(mlbench); library(e1071)
confusionMatrix(correct_classes, lda.pred$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low med_low med_high high
## low 18 10 2 0
## med_low 6 14 6 0
## med_high 0 5 14 2
## high 0 0 1 24
##
## Overall Statistics
##
## Accuracy : 0.6863
## 95% CI : (0.5869, 0.7745)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5814
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: low Class: med_low Class: med_high Class: high
## Sensitivity 0.7500 0.4828 0.6087 0.9231
## Specificity 0.8462 0.8356 0.9114 0.9868
## Pos Pred Value 0.6000 0.5385 0.6667 0.9600
## Neg Pred Value 0.9167 0.8026 0.8889 0.9740
## Prevalence 0.2353 0.2843 0.2255 0.2549
## Detection Rate 0.1765 0.1373 0.1373 0.2353
## Detection Prevalence 0.2941 0.2549 0.2059 0.2451
## Balanced Accuracy 0.7981 0.6592 0.7600 0.9550
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a dataset. When we cluster the observations of a dataset, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other.
In this chapter I’m going to reload the Boston dataset for the clustering purposes. First I’m going to scale the variables to get comparable distances between the observations. After this I’m going to run the k-means algorithm and investigate the optimal number of clusters.
K-means clustering is an unsupervised learning algorithm which means that there is no outcome to be predicted but instead tries to find patterns in the data.
RELOADING THE BOSTON DATASET
reboston <- Boston
str(reboston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
SUMMARY OF NOT SCALED BOSTON DATASET
summary(reboston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
SUMMARY OF SCALED BOSTON DATASET
reboston_scaled <- scale(reboston)
summary(reboston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
CALCULATING THE DISTANCES
Here I’m going to calculate the Euclidean distance which is a very common ordinary or straight-line distance measure. It is used by measuring similarity or dissimilarity of objects.
# Euclidean distance
dist_eu <- dist(reboston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
K-MEANS AND INITIAL AMOUNT OF CLUSTERS
K-means is one of the best-known clustering approaches whish is applied to partition the observations into a pre-specified number of clusters.
In this section I present some k-means pairplots adjusted with the euclidean distance and some random counts of centers.
library(dygraphs)
km_eu <- kmeans(dist_eu, centers = 2)
pairs(reboston_scaled, col = km_eu$cluster, main = "Pairs with 2 clusters")
km_eu2 <- kmeans(dist_eu, centers = 4)
pairs(reboston_scaled, col = km_eu2$cluster, main = "Pairs with 4 clusters")
DETERMINING THE NUMBER OF CLUSTERS
Here the purpose is to find the optimal number of clusters. Idea behind the K-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible. There are different methods to achieve this goal and I’m going to use total within sum of squares, WCSS, which can be calculated with a use of formula below:
\(WCSS = \sum_i^N (x_i - centroid)^2\)
The formula below is adopted from the DataCamp exercises as such. The task is to found an appropriate amount of clusters. The DC guidelines adviced to choose that number of clusters where the total WCSS drops radically. So if you look at the figure below, the most radical drop happens after the first number suggestings that the optimal amount could be 2 clusters.
On the other hand the statistical textbooks (e.g. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf) suggest that the clustering “results” may be optimal when the within-cluster variation is as small as possible.
I thought I could somehow combine the above advices and this is what I concluded: The second “cluster point” doens’t reflect the smallest variation even though there is a radical change after first “point. It seems that there are some drops between the second and sixth”point“, but between the fifth and sixth the within-cluster variation doesn’t seem be to be so different anymore. After this you may already guess that I’m suggesting 5 clusters. But I wouldn’t say that this has to be a final decision without using other tools and methods.
set.seed(123) # keeping the results same
k_max <- 10 # maximum number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b', main = "Number of proposed clusters")
USING THE ELBOW METHOD TO DETERMINE THE NUMBER OF CLUSTERS
Then there is a so called Elbow Method which also uses the within-cluster variance. You can adopt the method and algorith here. In this site the author suggest that “the location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters”.
Below you can perceive two plots with the Elbow method. I have to say that I’m not sure anymore what could be the “correct” number of clusters. Here I would rather say 2 than 5. Note that I draw the dashed lines just to compare the results.
But I’m not happy yet with the results. So let’s take a look at one more method after this.
set.seed(123)
# Computing and plotting wss for k = 2 to k = 10
k.max <- 10 # Maximal number of clusters
data <- reboston_scaled
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=10 )$tot.withinss})
plot(1:k.max, wss,
type="b", pch = 1, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares", main = "The amount of clusters")
abline(v = 2, lty =2)
# Another functions, same method
library(factoextra); require(ggplot2)
fviz_nbclust(reboston_scaled, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2)
USING THE “ALL VALIDATION INDICES” TO DETERMINE THE NUMBER OF CLUSTERS
This method is also adopted from the website I introduced above. This might be “heavy” method when it computes all the 30 indices for determining the optimal number of clusters (see ?NbClust for additional information). Note that the command “all” actually drops some indices for the sake of faster computing.
So according to this method of majority rule, the best number of clusters is 2.
library(NbClust)
nb <- NbClust(reboston_scaled, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "complete", index ="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 6 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
nb
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW
## 2 365.8858 138.3799 30.2810 -10.6105 546.752 3.282035e+33 379348.85
## 3 0.0051 88.3119 68.0689 -16.2831 1135.949 2.304751e+33 344603.04
## 4 0.4190 89.3535 157.2726 -14.9359 1713.232 1.309260e+33 251173.09
## 5 3.3113 127.0749 59.2562 -0.9093 2860.839 2.117710e+32 120863.36
## 6 4.9882 125.2845 20.4741 3.8642 3531.911 8.095706e+31 97956.82
## 7 0.6595 111.8671 24.1067 4.0150 3820.466 6.229953e+31 89795.14
## 8 0.2506 103.7538 73.4329 5.0250 4307.836 3.105734e+31 85970.73
## 9 121.0692 113.1228 7.0030 13.6910 4924.178 1.162705e+31 59638.61
## 10 0.1218 102.5418 11.2077 12.4447 5033.461 1.156612e+31 58526.58
## TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2
## 2 5546.998 11.6512 1.2746 0.3501 1.8755 0.1927 0.9033 32.2200
## 3 5232.616 15.3777 1.3511 0.4039 1.5318 0.1870 0.7283 74.9817
## 4 4608.911 20.8047 1.5340 0.4425 1.6959 0.1720 0.6174 182.7857
## 5 3509.433 45.3481 2.0146 0.3677 1.7257 0.2152 0.6582 51.4077
## 6 3138.254 49.5877 2.2528 0.3991 1.4939 0.2251 0.7893 26.6922
## 7 3014.803 52.4833 2.3451 0.4023 1.5854 0.1980 0.7194 28.4728
## 8 2875.870 60.0652 2.4584 0.4049 1.4818 0.2014 0.6649 90.2336
## 9 2506.302 67.5566 2.8209 0.3635 1.4674 0.2147 0.4475 4.9376
## 10 2471.477 68.5392 2.8606 0.3873 1.3914 0.2126 0.7514 7.9406
## Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert
## 2 1.0250 0.3193 2773.4992 0.3052 -0.7151 0.7265 0.0476 2e-04
## 3 3.5661 0.2869 1744.2052 0.3539 0.4543 0.7378 0.0563 3e-04
## 4 5.9326 0.2875 1152.2277 0.3723 0.1353 0.9946 0.0641 3e-04
## 5 4.9388 0.3104 701.8866 0.4607 -0.0405 1.7679 0.0644 4e-04
## 6 2.5390 0.2997 523.0423 0.4851 0.7911 1.8311 0.0731 5e-04
## 7 3.6965 0.2821 430.6862 0.4754 -0.0785 1.9927 0.0747 5e-04
## 8 4.8160 0.2683 359.4838 0.4819 0.2051 1.9994 0.0761 5e-04
## 9 9.4871 0.2641 278.4780 0.4892 -0.0289 2.4723 0.0766 5e-04
## 10 3.0515 0.2518 247.1477 0.4893 0.0608 2.4725 0.0815 5e-04
## SDindex Dindex SDbw
## 2 1.4091 3.1092 0.9516
## 3 1.6118 3.0485 1.1166
## 4 1.6525 2.8619 1.0039
## 5 1.7602 2.4429 0.9631
## 6 1.5711 2.3173 0.8411
## 7 1.6287 2.2698 0.7837
## 8 1.6017 2.2312 0.8377
## 9 1.5912 2.0942 0.7618
## 10 1.4766 2.0818 0.6648
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.8871 38.3134 0.4243
## 3 0.8721 29.4697 0.0000
## 4 0.8864 37.8040 0.0000
## 5 0.8377 19.1787 0.0000
## 6 0.8383 19.2906 0.0013
## 7 0.8190 16.1362 0.0000
## 8 0.8673 27.3952 0.0000
## 9 0.4753 4.4164 0.0000
## 10 0.7243 9.1356 0.0002
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 2.0000 2.0000 5.0000 9.000 5.000 5.000000e+00
## Value_Index 365.8858 138.3799 98.0165 13.691 1147.607 9.666747e+32
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 5.0 5.0000 5.0000 9.0000 2.0000 10.0000
## Value_Index 130309.7 728.2982 24.5433 -0.3228 0.3501 1.3914
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 6.0000 2.0000 2.00 2.000 2.0000 3.000
## Value_Index 0.2251 0.9033 32.22 1.025 0.3193 1029.294
## PtBiserial Frey McClain Dunn Hubert SDindex Dindex
## Number_clusters 10.0000 1 2.0000 10.0000 0 2.0000 0
## Value_Index 0.4893 NA 0.7265 0.0815 0 1.4091 0
## SDbw
## Number_clusters 10.0000
## Value_Index 0.6648
##
## $Best.partition
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 1 1
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2 2 2
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 2 2 2 2 2 2 2 2 1 2 1 1 2 1 2 2 1 1
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2
## 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
## 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1
## 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## 1 1 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2 2
## 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
## 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1
## 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378
## 2 2 2 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2
## 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 505 506
## 2 2
# Visualizing the results
fviz_nbclust(nb) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 9 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 6 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
Here you can look again the pair matrix of two clusters.
# Lets plot the pairs with two centers again
library(dygraphs)
pairs(reboston_scaled, col = km_eu$cluster, main = "Pairs with 2 clusters")
In this chapter the purpose is to get familiar with some Dimensionality Reduction Techniques including Principal Component Analysis (PCA) and Multiple Correspondece Analysis (MCA).
Principal Component Analysis is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. After founding the patterns PCA has the tools of compressing the data by reducing the number of dimensions, without much loss of information. Additional info of PCA
Multiple Correspondence Analysis is used for the same purposes as PCA but it can be utilized if the data contains categorical variables. Additional info of MCA
In this chapter I will be working with the Data of Human Development approach adopted from United Nations Development programme, UNDP. This programme has explored human development around the world with specific indices in each year since 1990, Country List in 2015. The specified indices are:
I’m going to work with HDI and GII indices by combining the datasets and using specific variables including:
Note 1, SFM_edu is a ratio of education where values less than 1 are signifying that males are more educated than females and values greater than 1 are implying that females are more educated than men.
Note 2, FMlabour is a ratio of labour force participation where values less than 1 are signifying that males are more participated than females and values greater than 1 are implying that females are more participated than men.
Note 3, variables from the Human Development Index data are Eedu, Elife and GNIpc while the other variables are from the Gender Inequality Index data.
For additional information you can find here the Human Development Index data and its components.
For additional information you can find here the Gender Inequality Index data and its components.
COMBINED DATASET
You can find the combined dataset from my GutHub respitory: https://github.com/8475394/IODS-project/blob/master/data/human_1.txt
library(tidyr); library(readr)
human_own <- read.table(file.path("C:/Users/heidi/Documents/YLIOPISTO/TILASTOTIEDE/INTRODUCTION TO OPEN DATA SCIENCE/IODS-project/data", "human_1.txt"), header = TRUE)
STRUCTURE OF THE HUMAN DATASET
From the structure table you can notice that there are 155 observation and 8 variables. Note that the country is actaully a rowname and it doesn’t appear to variable list.
str(human_own)
## 'data.frame': 155 obs. of 8 variables:
## $ SFM_edu : num 1.007 0.997 0.983 0.989 0.969 ...
## $ FMlabour : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Eedu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Elife : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNIpc : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ mort_ratio: int 4 6 6 5 6 7 9 28 11 8 ...
## $ ad_birth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ seats : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
In this section I’m going to explore the distributions of data variables even further. First I show you a summary table which contains the distributions of each variable where the variables are divided into quartiles including the minimum and maximum values.
After this I’ll show you a set of histograms and explore the potential relationhips between each variable.
SUMMARY TABLE OF DATA VARIABLES
You may see some intresting “phenomena” from the summary table. Both the ratio of education (SFM_edu) and labour force participation (FMlabour) suggest that males are more dominant in these fields. It seems that females are more equal in education than labour force participation. You can examine this visually from the histograms later in this chapter.
According to the Technical notes of the Human Development indices, the scale of expected years of schooling per country (Eedu) varies between 0 and 18 where the latter is equivalent to achieving a master’s degree in most countries. However you can see that the actual maximum of the schooling distribution is a bit over 20. If I understood right the range of 0-18 is used when counting the overall Human Development Index. Nevertheless I think the average of expected schooling years is quite good even though these values don’t tell us anything about specific countries.
I can’t help myself to wonder the minimum value of Expected life years at birth per country (Elife). 49 years seem so low even though the mean and median values are much greater.
I think the distribution of Gross National Income per capita PPP emphasizes the huge distinctions between countries and of their wealth. Note that GNI per capita in PPP measures purchasing power parity which is useful for making comparisons between countries, PPP. However, there is a but when measuring the Human Development Index. If you look at the table of HDI (as seen earlier), you can perceive that the GNI is not necessary the most critical measure in terms of well-being. For example the GNI of Montenegro is lower than average but the Human Development Index is very high. Note also that this was kind of an additional information since the actual HD index is not a part of this examination.
Based on the technical notes the Maternal Mortality rate represent the maternal health, the higher the rate the poorer the maternal health. If you look at the table of GII (as seen earlier), you can perceive that the maximum value of the maternal mortality among countries with very high human development is 69. This is just to give you some understanding of this scale.
Adolescent birth rate represents the rate of births per 1000 women between ages between 15 and 19. I think this scale varies also a lot. However the distribution is skewed to the right (positive skewness) suggesting that there are more lower than higher values in this scale.
The distribution of seats represents the share (%) of seats in parliament held by women in each country. Here you can see for example that it is not common to have more females than males in parliaments as the median and mean values are both close to 20 %. Also there are just few countries where the women are in majority.
library(knitr)
kable(summary(human_own), format = "pandoc", digits = 2, align="l")
| SFM_edu | FMlabour | Eedu | Elife | GNIpc | mort_ratio | ad_birth | seats | |
|---|---|---|---|---|---|---|---|---|
| Min. :0.1717 | Min. :0.1857 | Min. : 5.40 | Min. :49.00 | Min. : 581 | Min. : 1.0 | Min. : 0.60 | Min. : 0.00 | |
| 1st Qu.:0.7264 | 1st Qu.:0.5984 | 1st Qu.:11.25 | 1st Qu.:66.30 | 1st Qu.: 4198 | 1st Qu.: 11.5 | 1st Qu.: 12.65 | 1st Qu.:12.40 | |
| Median :0.9375 | Median :0.7535 | Median :13.50 | Median :74.20 | Median : 12040 | Median : 49.0 | Median : 33.60 | Median :19.30 | |
| Mean :0.8529 | Mean :0.7074 | Mean :13.18 | Mean :71.65 | Mean : 17628 | Mean : 149.1 | Mean : 47.16 | Mean :20.91 | |
| 3rd Qu.:0.9968 | 3rd Qu.:0.8535 | 3rd Qu.:15.20 | 3rd Qu.:77.25 | 3rd Qu.: 24512 | 3rd Qu.: 190.0 | 3rd Qu.: 71.95 | 3rd Qu.:27.95 | |
| Max. :1.4967 | Max. :1.0380 | Max. :20.20 | Max. :83.50 | Max. :123124 | Max. :1100.0 | Max. :204.80 | Max. :57.50 |
HISTOGRAMS OF EDUCATION, LABOUR FORCE, YEARS OF SCHOOLING AND LIFESPAN
library(ggplot2); library(gridExtra);library(dygraphs)
# INITIALIZING THE PLOTS
h1 <- qplot(human_own$SFM_edu, geom = "histogram", binwidth = 0.1, main = "Population with at least sec. school", xlab = "ratio of education, female/male", ylab = "number of countries", col = I("grey"), fill = I("chartreuse3"))
h2 <- qplot(human_own$FMlabour, geom = "histogram", binwidth = 0.08, main = "Labour force participation rate", xlab = "ratio of labour force, female/male", ylab = "number of countries", col = I("grey"), fill = I("orange"))
h3 <- qplot(human_own$Eedu, geom = "histogram", binwidth = 1.2, main = "Expected years of schooling per country", xlab = "years per country", ylab = "number of countries", col = I("grey"), fill = I("mediumorchid"))
h4 <- qplot(human_own$Elife, geom = "histogram", binwidth = 4, main = "Expected life years at birth per country", xlab = "years per country", ylab = "number of countries", col = I("grey"), fill = I("blue"))
# Combining the plots using the function grid.arrange()
grid.arrange(h1, h2, h3, h4, ncol=2, nrow =2)
HISTOGRAMS OF GNI, MATERNAL MORTALITY, ADOLESCENT BIRTH RATE AND SHARE OF SEATS
library(ggplot2); library(gridExtra); library(dygraphs)
# INITIALIZING THE PLOTS
h5 <- qplot(human_own$GNIpc, geom = "histogram", binwidth = 10000, main = "Gross National Income, GNI", xlab = "GNI per capita PPP", ylab = "number of countries", col = I("grey"), fill = I("red"))
h6 <- qplot(human_own$mort_ratio, geom = "histogram", binwidth = 50, main = "Maternal mortality ratio", xlab = "deaths per 100 000 live births", ylab = "number of countries", col = I("grey"), fill = I("cornflowerblue"))
h7 <- qplot(human_own$ad_birth, geom = "histogram", binwidth = 10, main = "Adolescent birth rate", xlab = "births per 1000 women ages 15-19", ylab = "number of countries", col = I("grey"), fill = I("yellow"))
h8 <- qplot(human_own$seats, geom = "histogram", binwidth = 3, main = "Share of seats in parliament", xlab = "% held by women", ylab = "number of countries", col = I("grey"), fill = I("violetred2"))
# Combining the plots using the function grid.arrange()
grid.arrange(h5, h6, h7, h8, ncol=2, nrow =2)
INITIAL RESULTS OF THE RELATIONSHIPS
Here you can have an initial look at the relationships between human data variables. I added a regression line into each scatterplot to better perceive the direction of the relationships.
NOTE, According to James, Witten, Hastie & Tibshirani (2013, 375) these two-dimenional scatterplots are most likely not so informative when we want to explore the total information of the data. This is the case especially if we have a lot of variables. Futhermore these two-dimensional plots contain just a small fraction of the total information present in the dataset.
However I will explore also the correlations between the variables before continuing the dimensionality reduction part. Correlation matrix and plot are seen after the initial results.
library(GGally); library(ggplot2)
g = ggpairs(human_own, lower = list(continuous = wrap("smooth", method = "lm", color="red")), upper = list(continuous = wrap("density", color = "blue")))
g
CORRELATION MATRIX BETWEEN DATA VARIABLES
Here I’m going to explore more closely the linear relationships between variables to get a better understanding of the data. You can see the correlation coefficients of each pair from table below. It is interesting that there is almost zero correlation between SFM_edu (ratio of education, F/M) and FMlabour (ratio of labour force, F/M). Initially I thought that there may be a linear relationship between these variables at least to some extent. However, the SFM_Edu has some other interesting relationships for example with Maternal mortality ratio (mort_ratio) and Adolescent birth rate (ad_birth). These relationships are negative, suggesting that when the ratio of Female Education is high the ratios of Maternal mortality and Adolecent’s birth are respectively low. You can perceive the same logic also between Expected years of schooling per country (Eedu) and the mort_ratio/ad_birth as well as between Expected life years at birth per country (Elife) and the mort_ratio/ad_birth.
Interestingly the Gross National Income per capita (GNIpc) has a positive relationship with Expected years of schooling per country (Eedu) and Expected life years at birth per country (Elife). It may be a natural consequence that if a country is wealthy they can invest more for education. But I think there might be some other components affecting to the lifespan of population even though there is a positive correlation between GNI and life years.
If you look at the “Seats” column (or row), i.e. Share of seats in parliament (% held by women), there are no high correlation values while it has relationships with FMlabour and Eedu in to some extent. The former may suggest that when the ratio of female labour force participation is high the number of seats in parliament (held by women) can also be high and vice versa. The latter can imply that when the Expected years of schooling per country is high the number of seats in parliament (held by women) can also be high. Here I can imagine that if there are high amount of schooling years in some country there would be also more educated women among the population to be voted to the parliament. Note that the latter point of view was just “an educated guess”!
After the correlation matrix you can see a more visual “correlation matrix”.
library(tidyverse); library(knitr)
cor_matrix <- cor(human_own) %>% round(digits = 2)
kable(cor_matrix, format = "pandoc", digits = 2, align="c")
| SFM_edu | FMlabour | Eedu | Elife | GNIpc | mort_ratio | ad_birth | seats | |
|---|---|---|---|---|---|---|---|---|
| SFM_edu | 1.00 | 0.01 | 0.59 | 0.58 | 0.43 | -0.66 | -0.53 | 0.08 |
| FMlabour | 0.01 | 1.00 | 0.05 | -0.14 | -0.02 | 0.24 | 0.12 | 0.25 |
| Eedu | 0.59 | 0.05 | 1.00 | 0.79 | 0.62 | -0.74 | -0.70 | 0.21 |
| Elife | 0.58 | -0.14 | 0.79 | 1.00 | 0.63 | -0.86 | -0.73 | 0.17 |
| GNIpc | 0.43 | -0.02 | 0.62 | 0.63 | 1.00 | -0.50 | -0.56 | 0.09 |
| mort_ratio | -0.66 | 0.24 | -0.74 | -0.86 | -0.50 | 1.00 | 0.76 | -0.09 |
| ad_birth | -0.53 | 0.12 | -0.70 | -0.73 | -0.56 | 0.76 | 1.00 | -0.07 |
| seats | 0.08 | 0.25 | 0.21 | 0.17 | 0.09 | -0.09 | -0.07 | 1.00 |
PLOTTING THE CORRELATION MATRIX
library(corrplot)
corrplot(cor_matrix, method = "circle", tl.cex = 0.8, addCoef.col = "black")
When we have a large set of correlated variables, Principal Component Analysis allows us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in our dataset. PCA is an unsupervised learning algorithm which means that there is no outcome to be predicted but instead it tries to find patterns in the data. This can be done by data visualizing as seen in this and will be seen in the next chapter.
First I’m going to perform principal component analysis (PCA) without standardizing the Human data. Note that the PCA is performed again within the next chapter but with standardized dataset. Thus the purpose is to perceive and interpret if there are any differences between these methods or models and my guess is that the actual purpose is to show and justify why the data has to be standardized.
SUMMARY TABLE OF NON-STANDARDIZED DATA
Here you can see the same summary table of Human dataset as seen in the beginning of this chapter. It is shown again for the practical reasons to better compare the results between PCAs with non-standardized and standardized datasets.
library(knitr)
kable(summary(human_own), format = "pandoc", digits = 2, align="l")
| SFM_edu | FMlabour | Eedu | Elife | GNIpc | mort_ratio | ad_birth | seats | |
|---|---|---|---|---|---|---|---|---|
| Min. :0.1717 | Min. :0.1857 | Min. : 5.40 | Min. :49.00 | Min. : 581 | Min. : 1.0 | Min. : 0.60 | Min. : 0.00 | |
| 1st Qu.:0.7264 | 1st Qu.:0.5984 | 1st Qu.:11.25 | 1st Qu.:66.30 | 1st Qu.: 4198 | 1st Qu.: 11.5 | 1st Qu.: 12.65 | 1st Qu.:12.40 | |
| Median :0.9375 | Median :0.7535 | Median :13.50 | Median :74.20 | Median : 12040 | Median : 49.0 | Median : 33.60 | Median :19.30 | |
| Mean :0.8529 | Mean :0.7074 | Mean :13.18 | Mean :71.65 | Mean : 17628 | Mean : 149.1 | Mean : 47.16 | Mean :20.91 | |
| 3rd Qu.:0.9968 | 3rd Qu.:0.8535 | 3rd Qu.:15.20 | 3rd Qu.:77.25 | 3rd Qu.: 24512 | 3rd Qu.: 190.0 | 3rd Qu.: 71.95 | 3rd Qu.:27.95 | |
| Max. :1.4967 | Max. :1.0380 | Max. :20.20 | Max. :83.50 | Max. :123124 | Max. :1100.0 | Max. :204.80 | Max. :57.50 |
PERFORMING PCA WITH SINGULAR VALUE DECOMPOSITION METHOD
There are two slightly different methods to perform the PCA: the Eigenvalue Decomposition and the Singular Value Decomposition. According to our instructions in DataCamp the latter is preferred.
From the tables and biplot below you can see the results of Principal Component Analysis with non-standardized Human data. The results are trying to say that there is only one principal component in the Human data. Since PCA is based on capturing the biggest variation of the dataset, you can perceive that the method captures only the variance of Gross National Income (GNI). This is because the GNI has a largest range of its scale and the other variables are not comparable with this scale. Thus, we can have such a misleading results if we forget to standardize our data.
# Performing the PCA with Singular Value Decomposition (SVD) method
pca_human1 <- prcomp(human_own)
s1 <- summary(pca_human1)
s1
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# rounded percetanges of variance captured by each PCA
pca_pr1 <- round(100*s1$importance[2, ], digits = 1)
print(pca_pr1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# creating object pc_lab to be used as axis labels
pc_lab1 <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
pc_lab1
## [1] "PC1 (100%)" "PC2 (0%)" "PC3 (0%)" "PC4 (0%)" "PC5 (0%)"
## [6] "PC6 (0%)" "PC7 (0%)" "PC8 (0%)"
# drawing the biplot with captions
biplot(pca_human1, cex = c(0.8, 1), col = c("dimgrey", "darkorange"), xlab = "100%", ylab = "0%")
Here I provide some additional information of PCA. The purpose of PCA is to find a low-dimensional representation of a dataset that contains as much as possible of the variation. It is very important to realize that not all of these dimensions, i.e. principal components, are equally interesting. Furthermore it is known that the first principal component captures the most of the variance in the normalized (standardized) dataset. The second principal component captures the maximal variance of the dataset that are uncorrelated with the first component and so on.(James, Witten, Hastie & Tibshirani, 2013, 375-376.)
SUMMARY TABLE OF THE STANDARDIZED HUMAN DATA
As you can see from the table below, the distributions of the variables changed after stardardizing the dataset. Now the mean of every variable is zero and the observations varies on both sides of zero. This means that the different distributions are now more comparable for the further purposes.
# Standardizing the Human data
std_human <- scale(human_own)
library(knitr)
kable(summary(std_human), format = "pandoc", digits = 2, align="l")
| SFM_edu | FMlabour | Eedu | Elife | GNIpc | mort_ratio | ad_birth | seats | |
|---|---|---|---|---|---|---|---|---|
| Min. :-2.8189 | Min. :-2.6247 | Min. :-2.7378 | Min. :-2.7188 | Min. :-0.9193 | Min. :-0.6992 | Min. :-1.1325 | Min. :-1.8203 | |
| 1st Qu.:-0.5233 | 1st Qu.:-0.5484 | 1st Qu.:-0.6782 | 1st Qu.:-0.6425 | 1st Qu.:-0.7243 | 1st Qu.:-0.6496 | 1st Qu.:-0.8394 | 1st Qu.:-0.7409 | |
| Median : 0.3503 | Median : 0.2316 | Median : 0.1140 | Median : 0.3056 | Median :-0.3013 | Median :-0.4726 | Median :-0.3298 | Median :-0.1403 | |
| Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | |
| 3rd Qu.: 0.5958 | 3rd Qu.: 0.7350 | 3rd Qu.: 0.7126 | 3rd Qu.: 0.6717 | 3rd Qu.: 0.3712 | 3rd Qu.: 0.1932 | 3rd Qu.: 0.6030 | 3rd Qu.: 0.6127 | |
| Max. : 2.6646 | Max. : 1.6632 | Max. : 2.4730 | Max. : 1.4218 | Max. : 5.6890 | Max. : 4.4899 | Max. : 3.8344 | Max. : 3.1850 |
PERFORMING PCA WITH SINGULAR VALUE DECOMPOSITION METHOD
The tables and biplot below tell a huge different story about principal components in the Human data compared with earlier findings. The tables suggest that the first principal component has captured 53.6% of the total variance from the Human data. The second component has captured 16.2% and so on. As you can see, these shares are diminishing quite fast implying that the first principal components actually captures the most of the variance.
# Performing the PCA with Singular Value Decomposition (SVD) method
pca_human2 <- prcomp(std_human)
s2 <- summary(pca_human2)
s2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# rounded percetanges of variance captured by each PCA
pca_pr2 <- round(100*s2$importance[2, ], digits = 1)
print(pca_pr2)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
ILLUSTRATING AND INTERPRETING THE RESULTS WITH A BIPLOT
The pricipal components are illustrated in the biplot below. Just to remind that there are 155 countries and 8 variables thus the principal component score vectors have lenght n=155, and the pricipal component loading vectors have lenght p=8. Figure below plots the first two pricipal components of the data and represents both the principal component scores and the loading vectors.
From the biplot we can see that the first principal component places approximately equal weight on mort_ratio and ad_birth corresponding to their high reciprocal and positive correlation. Accordingly The first pricipal component places approximately equal weight on SFM_edu, Eedu, Elife and GNIpc which also corresponds with the high positive correlations between each of these variables. Note that each of the former variables correlate negatively with each of the latter variables“. This means that these variables belong into the same principal component, i.e. representing the same phenomenon but from the opposite angles.
Now I have to think what could be the common nominator for all of the variables listed so far. I think I will call the first principal component as the National well-being index since the component consists features like level and lenght of education, the lifespan and the Gross National Income combined with maternal mortality ratio and adolescent’s birth rate. Let’s take some examples from the biplot. Countries like Qatar, Australia and Norway seem to have a high index of wellbeing since they have high negative scores on the first component, i.e. low scores in terms of mortality and adolescent’s birth rate and “high opposite scores”. Accordingly countries like Chad, Niger and Sierra Leone seem to have a low index of well-being since they have high positive scores on the first component, i.e. high scores in terms of mortality and adolescent’s birth rate and “low opposite scores”.
I think, in certain terms, I could compare the scale of the the first principal component to Osgood’s scale which attemps to measure some polarities or contrasts but usually with opposite adjectives. But if we wanted to combine the variable for example by summing, we could translate the scale direction of either set of variables.
Then we can take a look at the other axis i.e. the second principal component. From the biplot we can see that the second component places approximately equal weight on FMlabour and seats also corresponding to their high positive correlation. Both of the variables could represent the rate of females at “work”, even though the share of seats in parliament can be seen as a public “job”. Nevertheless the variables could represent also some forms of equality and inequality. I think I will name the second component as National Index of Equal Employment.
Let’s take some examples from the biplot in terms of the second component. Countries like Iceland, Bolivia, Mozambique and especially the Rwanda have high positive scores on our “National Index of Equal Employment” while countries like Iran, Lebanon and Yemen have very low scores.
But since the biplot is a “fourfold table”, we can find more interpretations. We can start from the upper left corner where we can find Finland, earlier mentioned Iceland and Denmark, i.e. Scandinavia and some other European countries. These countries represent the ideal places considering both our indices, i.e. high in employment equality and well-being. Upper right corner has high in equality but low in well-being, e.g. Mozambigue. Lower right corner has low both in well-being and equality and there you can find for exmple Niger and Afghanistan. Finally we have lower left corner which represent countries with high in well-being but low with employment equality. There are countries like Qatar, Kuwait and Saudi-Arabia.
# drawing the biplot again with captions
biplot(pca_human2, cex = c(0.8, 1), col = c("dimgrey", "deeppink"), xlab = "National well-being index", ylab = "National Index of Equal Employment")
Here the purpose is to get familiar with Multiple Correspondece Analysis, MCA. It is a dimensions reducing method and it represents the data in 2- or 3-dimensional space. Here you can find additional information concerning the MCA.
For the purpose of Multiple Correspondence Analysis I have to download a new dataset since MCA can be utilized when the data contains categorical variables and my earlier dataset contained only continuous variables.
I’m going to download the Tea dataset from the FactoMineR packages.
STRUCTURE OF THE TEA DATASET
As you can imagine the data concern a questionnaire on tea consumption. There are 300 individuals who where asked how they drink tea, 18 questions. Then there are supplementary variables like age and sex. Note that the age is the only continuous variable in the dataset. 12 questions concern about individuals’ product perception and 4 guestions consis some personal details. You can see the whole structure from table below.
library(FactoMineR) # https://cran.r-project.org/web/packages/FactoMineR/index.html
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
VARIABLES TO KEEP
I’m going to reduce the number of variables and choose some specific active variables, including:
The time of the day when the tea in drunk
Tea-drinking behaviour
Additionally I’m going to call this new data as my_tea
library(dplyr)
# identifying the columns to keep
keep_columns <- c("breakfast", "evening", "lunch", "dinner", "Tea", "How", "sugar", "how", "where", "price")
# selecting the columns
my_tea <- dplyr::select(tea, one_of(keep_columns))
STRUCTURE OF MY_TEA DATASET
From the table below you can perceive that there are 300 observations, i.e. individuals, and 11 variables which all are categorical.
str(my_tea)
## 'data.frame': 300 obs. of 10 variables:
## $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
head(my_tea)
## breakfast evening lunch dinner Tea How sugar
## 1 breakfast Not.evening Not.lunch Not.dinner black alone sugar
## 2 breakfast Not.evening Not.lunch Not.dinner black milk No.sugar
## 3 Not.breakfast evening Not.lunch dinner Earl Grey alone No.sugar
## 4 Not.breakfast Not.evening Not.lunch dinner Earl Grey alone sugar
## 5 breakfast evening Not.lunch Not.dinner Earl Grey alone No.sugar
## 6 Not.breakfast Not.evening Not.lunch dinner Earl Grey alone No.sugar
## how where price
## 1 tea bag chain store p_unknown
## 2 tea bag chain store p_variable
## 3 tea bag chain store p_variable
## 4 tea bag chain store p_variable
## 5 tea bag chain store p_variable
## 6 tea bag chain store p_private label
SUMMARY TABLE OF THE MY_TEA DATASET
From the table below you can have closer look at the categories of the kept variables.
summary(my_tea)
## breakfast evening lunch dinner
## breakfast :144 evening :103 lunch : 44 dinner : 21
## Not.breakfast:156 Not.evening:197 Not.lunch:256 Not.dinner:279
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
## where price
## chain store :192 p_branded : 95
## chain store+tea shop: 78 p_cheap : 7
## tea shop : 30 p_private label: 21
## p_unknown : 12
## p_upscale : 53
## p_variable :112
VISUALIZING THE STRUCTURES OF DATASET
library(tidyr); library(ggplot2)
gather(my_tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "darkorchid1") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
MULTIPLE CORRESPONDENCE ANALYSIS WITH MY_TEA DATASET
This dataset could be studied according to the individuals, the variables and/or the categories. If studying the individuals, it mean to understand similarities between the individuals in terms of all variables. For example, two tea drinker are considered similar if they have answered the questions in the same way.
When studying variables and categories the aim is to summarise the relationships between all of the variables. Thus we are looking for a “synthetic” variables which sum up the information contained within a number of variables. This information can be studied in terms of the categories.
In MCA the main focus is on studying categories, as they represent both variables and a group of individuals.
Here the MCA is performed on the active individuals/variables.
library(FactoMineR)
# Not yet plotting
mca <- MCA(my_tea, graph = FALSE)
print(mca) # Contains results as lists
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 10 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "intermediate results"
## 12 "$call$marge.col" "weights of columns"
## 13 "$call$marge.li" "weights of rows"
SUMMARY TABLES OF MCA
The MCA summary contains 4 tables including:
summary(mca, nb.dec = 2)
##
## Call:
## MCA(X = my_tea, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.21 0.18 0.14 0.14 0.12 0.11 0.11
## % of var. 11.27 9.54 7.34 7.12 6.55 5.97 5.61
## Cumulative % of var. 11.27 20.81 28.15 35.27 41.82 47.79 53.40
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.10 0.10 0.09 0.09 0.08 0.08 0.08
## % of var. 5.28 5.12 4.99 4.83 4.43 4.16 3.95
## Cumulative % of var. 58.67 63.79 68.79 73.61 78.04 82.20 86.15
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19
## Variance 0.07 0.06 0.05 0.05 0.03
## % of var. 3.49 3.30 2.89 2.47 1.69
## Cumulative % of var. 89.65 92.95 95.84 98.31 100.00
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | -0.40 0.25 0.05 | 0.30 0.17 0.03 | 0.45 0.49 0.06
## 2 | -0.22 0.07 0.04 | -0.16 0.04 0.02 | 0.76 1.40 0.46
## 3 | -0.09 0.01 0.00 | 0.24 0.11 0.03 | -0.33 0.25 0.05
## 4 | -0.20 0.06 0.02 | 0.34 0.21 0.06 | -0.35 0.28 0.06
## 5 | -0.26 0.10 0.08 | -0.13 0.03 0.02 | -0.04 0.00 0.00
## 6 | -0.26 0.11 0.02 | 0.60 0.66 0.11 | 0.01 0.00 0.00
## 7 | -0.28 0.12 0.11 | -0.07 0.01 0.01 | 0.21 0.10 0.06
## 8 | -0.16 0.04 0.02 | -0.05 0.01 0.00 | 0.33 0.26 0.08
## 9 | 0.41 0.26 0.10 | -0.57 0.59 0.19 | 0.40 0.38 0.09
## 10 | 0.64 0.64 0.24 | -0.50 0.46 0.14 | 0.38 0.34 0.08
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | -0.08 0.16 0.01 -1.40 | -0.35 3.33 0.12 -5.89 |
## Not.breakfast | 0.08 0.15 0.01 1.40 | 0.33 3.07 0.12 5.89 |
## evening | 0.06 0.07 0.00 0.81 | -0.16 0.50 0.01 -2.03 |
## Not.evening | -0.03 0.04 0.00 -0.81 | 0.08 0.26 0.01 2.03 |
## lunch | -0.04 0.01 0.00 -0.29 | -0.88 6.30 0.13 -6.33 |
## Not.lunch | 0.01 0.00 0.00 0.29 | 0.15 1.08 0.13 6.33 |
## dinner | 0.57 1.06 0.02 2.71 | 0.83 2.63 0.05 3.92 |
## Not.dinner | -0.04 0.08 0.02 -2.71 | -0.06 0.20 0.05 -3.92 |
## black | 0.41 1.90 0.05 4.02 | -0.06 0.05 0.00 -0.61 |
## Earl Grey | -0.22 1.44 0.09 -5.09 | -0.14 0.68 0.03 -3.22 |
## Dim.3 ctr cos2 v.test
## breakfast 0.37 4.63 0.12 6.09 |
## Not.breakfast -0.34 4.27 0.12 -6.09 |
## evening -0.61 9.11 0.19 -7.61 |
## Not.evening 0.32 4.77 0.19 7.61 |
## lunch -0.79 6.63 0.11 -5.69 |
## Not.lunch 0.14 1.14 0.11 5.69 |
## dinner -0.34 0.57 0.01 -1.60 |
## Not.dinner 0.03 0.04 0.01 1.60 |
## black 1.06 20.01 0.37 10.52 |
## Earl Grey -0.40 7.56 0.30 -9.40 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.01 0.12 0.12 |
## evening | 0.00 0.01 0.19 |
## lunch | 0.00 0.13 0.11 |
## dinner | 0.02 0.05 0.01 |
## Tea | 0.09 0.11 0.38 |
## How | 0.06 0.09 0.27 |
## sugar | 0.04 0.01 0.25 |
## how | 0.60 0.41 0.01 |
## where | 0.70 0.55 0.01 |
## price | 0.61 0.33 0.03 |
EIGENVALUES
The proportion of variance retained by different dimension (axes) can be extracted using the function “get_eigenvalue”. Note that the eigenvalues were shown also in the table above but here I’m going to show also a screeplot of eigenvalues. All of these tables and plots are telling that the first two dimensions are capturing only approximately 21% of the variance.
library(factoextra)
# http://www.sthda.com/english/wiki/multiple-correspondence-analysis-essentials-interpretation-and-application-to-investigate-the-associations-between-categories-of-multiple-qualitative-variables-r-software-and-data-mining
eigenvalues <- get_eigenvalue(mca)
head(round(eigenvalues, 2))
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.21 11.27 11.27
## Dim.2 0.18 9.54 20.81
## Dim.3 0.14 7.34 28.15
## Dim.4 0.14 7.12 35.27
## Dim.5 0.12 6.55 41.82
## Dim.6 0.11 5.97 47.79
# scree plot
fviz_screeplot(mca)
BIPLOT OF VARIABLES/CATEGORIES INCLUDING INDIVIDUALS
This biplot shows a global patter within the data. Rows, i.e. individuals, are shown by blue points and number, whereas the columns (variable categories) by red triangles. The distance between any row points or column points gives a measure of their similarity or dissimilarity. Thus row/column points with similar profile are closed on the factor map.
library(factoextra)
fviz_mca_biplot(mca)
BIPLOT OF VARIABLES/ CATEGORIES
Here I show the biplot of variables and their categories. We can say that the first dimension or component opposes for example the categories “tea shop” with the category “chain store”, “unpackaged” with the “tea bag”, “p_upscale” with the categories “p_unknown”, “p_private label”, “p_branded” and “p-cheap”. Thus it seems that the first component opposes at least sum luxury features with those of the more regular or ordinary features.
The second component distinguishes for example the categories “breakfast” with “Not breakfast”, “alone” with “lemon” and “milk”, “lunch” with “not lunch” and the category “green” with categories “Earl grey” and “black”. Thus it seems that second component opposes some features of timing and flavour.
plot(mca, invisible = c("ind"), habillage = "quali")